Table of Contents¶

  1. Introduction
  2. Import Libraries and Load the data
  3. Text Preprocessing
  4. EDA
  5. Transforming text to a vector
  6. MultiLabel Classifier
  7. Evaluation
  8. HyperParameter Tuning
  9. Feature Importance
  10. Conclusion
  11. References

Introduction¶

The task for this project is to automatically predict the job title from job descriptions. We start from preprocessing the job descriptions and title of Job Data and then we will build a simple model to predict the Job Title.

Import Libraries and Load the data¶

In this task we will be using the following libraries:

  • Numpy — a package for scientific computing.
  • Pandas — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
  • scikit-learn — a tool for data mining and data analysis.
  • NLTK — a platform to work with natural language.
In [ ]:
import pandas as pd
from scipy.sparse import coo_matrix, vstack
from sklearn.preprocessing import MultiLabelBinarizer
import lightgbm as lgb
import scipy
import numpy as np
import nltk, re
nltk.download('stopwords') # load english stopwords
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter("ignore")
warnings.warn("deprecated", DeprecationWarning)
warnings.simplefilter("ignore")
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tajalahluwalia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [ ]:
 
In [ ]:
dataset = pd.read_csv('Train_rev1.csv', error_bad_lines=False, engine="python")

# dataset = dataset.sample(10000)

dataset['Category']= dataset.Category.str[0:].str.split(' ').tolist()

dataset = dataset[['FullDescription', 'Category']]
# 70-30% random split of dataset
X_train, X_test, y_train, y_test = train_test_split(dataset['FullDescription'].values, 
                                                    dataset['Category'].values, 
                                                    test_size=0.33, 
                                                    random_state=42)
In [ ]:
dataset
Out[ ]:
FullDescription Category
0 Engineering Systems Analyst Dorking Surrey Sal... [Engineering, Jobs]
1 Stress Engineer Glasgow Salary **** to **** We... [Engineering, Jobs]
2 Mathematical Modeller / Simulation Analyst / O... [Engineering, Jobs]
3 Engineering Systems Analyst / Mathematical Mod... [Engineering, Jobs]
4 Pioneer, Miser Engineering Systems Analyst Do... [Engineering, Jobs]
... ... ...
244763 Position: Qualified Teacher Subject/Specialism... [Teaching, Jobs]
244764 Position: Qualified Teacher or NQT Subject/Spe... [Teaching, Jobs]
244765 Position: Qualified Teacher Subject/Specialism... [Teaching, Jobs]
244766 Position: Qualified Teacher Subject/Specialism... [Teaching, Jobs]
244767 This entrepreneurial and growing private equit... [Teaching, Jobs]

244768 rows × 2 columns

Text Preprocessing¶

The fact that natural data is unstructured is one of the most well-known challenges when working with it. If we use it "as is" and extract tokens by splitting titles by whitespaces, we'll see that there are a lot of "strange" tokens. To avoid these issues, it's usually a good idea to prepare the data in some way.

In [ ]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = list((stopwords.words('english')))

def text_prepare(text,join_sumbol):
    
    text = str(text)
    """
        text: a string
        
        return: modified initial string
    """

    # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(REPLACE_BY_SPACE_RE," ",text,)

    # lowercase text
    text = text.lower() 

    # delete symbols which are in BAD_SYMBOLS_RE from text
    text = re.sub(BAD_SYMBOLS_RE,"",text)
    text = re.sub(r'\s+'," ",text)

    # delete stopwords from text
    text = f'{join_sumbol}'.join([i for i in text.split() if i not in STOPWORDS])
    
    return text
In [ ]:
#  test if text_prepare works
tests = ["Clean this $%$^^$ 1213 data!!"]
for test in tests: print(text_prepare(test,' '))
clean 1213 data

We can now use the function text_prepare to preprocess the data, ensuring that they do not include any invalid symbols.

In [ ]:
X_train = [text_prepare(x,' ') for x in X_train]
X_test = [text_prepare(x,' ') for x in X_test]
y_train = [text_prepare(x,' ') for x in y_train]
y_test = [text_prepare(x,' ') for x in y_test]

EDA¶

In [ ]:
from collections import Counter
from itertools import chain

# Dictionary of all tags from train corpus with their counts.
tags_counts = Counter(chain.from_iterable([i.split(" ") for i in y_train]))

# Dictionary of all words from train corpus with their counts.
words_counts = Counter(chain.from_iterable([i.split(" ") for i in X_train]))

top_3_most_common_tags = sorted(tags_counts.items(), key=lambda x: x[1], reverse=True)[:3]
top_3_most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:3]

print(f"Top three most popular Categories words are: {','.join(tag for tag, _ in top_3_most_common_tags)}")
print(f"Top three most popular Description words are: {','.join(tag for tag, _ in top_3_most_common_words)}")
Top three most popular Title words are: jobs,engineering,accounting
Top three most popular Description words are: experience,role,work

Transforming text to a vector¶

We can't use the provided text data "as is" since machine learning algorithms work on numeric data. Text data can be converted into numeric vectors in a variety of ways. We'll try to employ two of them in this article.

Bag-of-words¶

A bag-of-words representation is one of the most well-known techniques. Follow the steps below to accomplish this transformation:

  • Enumerate the N most popular terms in the train corpus. We now have a dictionary of the most commonly used terms.
  • Create a zero vector with a dimension of N for each title in the corpus.
  • Iterate over terms in the dictionary for each text in the corpora, increasing the relevant coordinate by one.

The described encoding will now be implemented in the function with a dictionary size of 10000. We use train data to find the most common terms.

In [ ]:
# We considered only the top 5,000 words, this parameter can be fine-tuned
DICT_SIZE = 5000
WORDS_TO_INDEX = {j[0]:i for i,j in enumerate(sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE])}
INDEX_TO_WORDS = {i:j[0] for i,j in enumerate(sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE])}
ALL_WORDS = WORDS_TO_INDEX.keys()

def my_bag_of_words(text, words_to_index, dict_size):
    """
        text: a string
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = np.zeros(dict_size)
    keys= [words_to_index[i] for i in text.split(" ") if i in words_to_index.keys()]
    result_vector[keys]=1
    return result_vector

We apply the implemented function to all samples at this point. To store the useful information efficiently, we change the data to a sparse representation. There are many different sorts of such representations, but sklearn algorithms can only deal with the csr matrix, so we'll use that.

In [ ]:
X_train_mybag = scipy.sparse.vstack([scipy.sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_test_mybag = scipy.sparse.vstack([scipy.sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])
print('X_train shape ', X_train_mybag.shape)
print('X_test shape ', X_test_mybag.shape)
X_train shape  (163994, 10000)
X_test shape  (80774, 10000)

TF-IDF¶

The second method builds on the bag-of-words framework by accounting for the total frequency of words in the corpora. It aids in penalizing overuse of words and providing more features space.

To train a vectorizer, we utilize scikit-TfidfVectorizer learn's and our train corpus. Don't forget to investigate the arguments you can send to it. Too few words (less than 5) and too many words are filtered away (occur more than in 90 percent of the titles). Use bigrams and unigrams in our vocabulary as well.

In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_features(X_train, X_test):
    """
        X_train, X_val, X_test — samples        
        return bag-of-words representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train, test, and val sets and return the result
    
    
    tfidf_vectorizer = TfidfVectorizer(X_train,ngram_range=(1,2),max_df=0.9,min_df=5,token_pattern=r'(\S+)' )
    tfidf_vectorizer.fit(X_train)
    X_train = tfidf_vectorizer.transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    
    return X_train, X_test, tfidf_vectorizer.vocabulary_

X_train_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}
In [ ]:
print("manager" in set(tfidf_reversed_vocab.values()))
print("engineer" in set(tfidf_reversed_vocab.values()))
True
True

MultiLabel Classifier¶

As we've seen before, each sample in this exercise can contain multiple terms in the title. We must convert labels to binary form, with the prediction being a mask of 0s and 1s. MultiLabelBinarizer from sklearn is useful for this.

Before providing each element of labels to the MultiLabelBinarizer, we must first convert it to a dictionary.

In [ ]:
# transform to dictionary
y_train = [set(i.split(' ')) for i in y_train]
y_test = [set(i.split(' ')) for i in y_test]
In [ ]:
y_train[12]
Out[ ]:
{'accounting', 'finance', 'jobs'}
In [ ]:
# fit the transformer
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(y_train)
y_test = mlb.fit_transform(y_test)
In [ ]:
y_train[0]
Out[ ]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])
In [ ]:
mlb.classes_
Out[ ]:
array(['accounting', 'admin', 'advertising', 'catering', 'charity',
       'cleaning', 'construction', 'consultancy', 'creative', 'customer',
       'design', 'domestic', 'energy', 'engineering', 'finance', 'gas',
       'general', 'graduate', 'healthcare', 'help', 'hospitality', 'hr',
       'jobs', 'legal', 'logistics', 'maintenance', 'manufacturing',
       'marketing', 'nursing', 'oil', 'part', 'pr', 'property', 'qa',
       'recruitment', 'retail', 'sales', 'scientific', 'services',
       'social', 'teaching', 'time', 'trade', 'travel', 'voluntary',
       'warehouse', 'work'], dtype=object)

We recommend adopting the One-vs-Rest technique in this task, which is implemented in the OneVsRestClassifier class. k classifiers (= number of tags) are trained in this method.

It is one of the most basic strategies, but it often suffices in text categorization tasks.

Because there are so many classifiers to train, it may take some time.

In [ ]:
# For multiclass classification
from sklearn.multiclass import OneVsRestClassifier

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from lightgbm import LGBMClassifier

def train_classifier(X_train, y_train, X_valid=None, y_valid=None, C=1.0, model='lr'):
    """
      X_train, y_train — training data
      
      return: trained classifier
      
    """
    
    if model=='lr':
        model = LogisticRegression(C=C, penalty='l1', dual=False, solver='liblinear')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='svm':
        model = LinearSVC(C=C, penalty='l1', dual=False, loss='squared_hinge')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='nbayes':
        model = MultinomialNB(alpha=1.0)
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
        
    elif model=='lda':
        model = LinearDiscriminantAnalysis(solver='svd')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)

    return model

# Train the classifiers for different data transformations: bag-of-words and tf-idf.

# Linear NLP model using bag of words approach
%time classifier_mybag = train_classifier(X_train_mybag, y_train, C=1.0, model='lr')

# Linear NLP model using TF-IDF approach
%time classifier_tfidf = train_classifier(X_train_tfidf, y_train, C=1.0, model='lr')
CPU times: user 7min 51s, sys: 9 s, total: 8min
Wall time: 8min 2s
CPU times: user 5min 28s, sys: 22.6 s, total: 5min 51s
Wall time: 5min 56s

Create predictions for the data.¶

In [ ]:
y_test_predicted_labels_mybag = classifier_mybag.predict(X_test_mybag)

y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
In [ ]:
y_test_pred_inversed = mlb.inverse_transform(y_test_predicted_labels_tfidf)
y_test_inversed = mlb.inverse_transform(y_test)
for i in range(3):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(y_test_pred_inversed[i])
    ))
Title:	business account manager plumbing heating products basic salary south west company currently recruiting due internal promotion wellknown construction group history stretching back turnover excess 300 million selling plumbing heating products plumbing heating contractors dealing every sector smes 1man bands right larger contractors role even split account management new business development inherit excellent accounts sales career opportunities within group excellent person following skills field sales track record sold construction industrial product seek someone excellent organisational account development skills receive full product training package basic salary 30 bonuses fully expensed company car mobile pension laptop south west wales avon bristol bath somerset dorset gloucestershire cornwall wiltshire glamorgan carmarthenshire cardiff swansea bms leading consultancy specialising sales recruitment established 1990 bms achieved truly nationwide presence number regional centres south west operation established 1999 introduced service needs candidates clients alike throughout south west wales offering sales jobs trainees sales representatives sales executives sales engineers area sales managers territory managers account managers opportunities available every corner uk initial meetings occur convenient location bristol m4 m32 within easy reach m5 committed meeting potentially suitable candidates face face furthermore organisation consists several highly focused teams aimed specific market sectors enabling us deliver service directly tailored needs please take time search website wwwbmsukcom sales alternatively contact tina vine job originally posted wwwtotaljobscom jobseeking businessaccountmanager_job
True labels:	jobs,sales
Predicted labels:	jobs,sales


Title:	job title staff nurse rgn rmn nightslocation newton abbeysalary per hourhours part time 22 hours per weekskills nmc registration nursing home rgn rmn staff nurse registered general nurse registered mental health nurse mental health old agejob reference rgnregional recruitment services currently recruiting staff nurse rgn rmn work within medium sized nursing home based newton abbey area role provide high standard nursing care clients suffering old age mental health physical disabilitiesto ensure compliance cqc standards guidelinesto write implement set care plansto administer prescribed medicationto conduct accurate risk assessments ensure information recorded correctly ensure smooth running home night shifts oversee unqualified members staff whilst shift candidate must registered general nurse registered mental health nursemust current nmc pin numbermust able work nights part time basismust previous experience working within nursing home setting must previous experience working clients suffer old age mental health physical disabilities must passion caring others package competitive salary generous package excellent benefits befits one uks prestigious organisations highly competitive holiday package selection benefits excellent working environment promotion opportunities high levels job security great career pathway combined clinical skills development secure supportive working environmentto considered opportunity please apply directly website send cv us us directly alexhowarthregionalrecruitmentcom would like speak us detail applying please call alex howarth danielle fyfe quoting reference rgn position advertised behalf regional recruitment services ltd also variety permanent positions available ranging care assistants care home managers care jobs charge nurse jobs childrens nurses clinical lead nurses clinical nurses clinical nurse specialists community childrens nurses community mental health community sisters community staff nurses community workers deputy care managers deputy ward managers district nurses team leaders emergency nurse posts hdu nurse positions health care assistants home manager jobs icu nurse lead nurse midwife modern matron neonatal staff nurse jobs nurse advisors nurse team leader posts nursing auxiliary nvq assessor occupational health nurses occupational therapists oncology nurses paediatric nurses practice nurses recovery nurses registered general nurse posts registered nurse posts residential adult care jobs residential child care jobs rgns rmns rnlds school nurses scrub nurses senior sisters social care posts social worker positions staff nurse e support workers theatre manager posts theatre nurses theatre practitioners theatre support workers ward managers ward sister posts
True labels:	healthcare,jobs,nursing
Predicted labels:	healthcare,jobs,nursing


Title:	dynamic international development charity recruiting community fundraising manager play key part delivery charitys fundraising strategy take lead developing community fundraising programme including expanding appeal programme charitys volunteer network key day day duties within position include work closely direct marketing manager deliver fundraising appeal schools churches including undertaking evaluation previous appeals developing plan increase income community dm activity take lead developing high value support schools churches community groups develop grow friends network including building effective relationships existing groups individual community fundraisers recruiting new supporters increasing overall support important ambassadors develop charitys speaker network contribute development community fundraising strategy undertake review analysis past activity identify areas potential growth identify new business opportunities community fundraising demonstrate excellent relationship management key community groups individuals coordinate supervise thank reactivation recruitment calling process undertaken volunteer including evaluating impact calls make manage community fundraising budget effectively monitoring performance budget monthly basis contributing regular reforecasting line management teams fundraising assistant successful applicant following skills experience educated degree level equivalent significant demonstrable experience working community fundraising role experience managing mass communications direct mail appeals experience effectively building managing relationships range groups individuals experience working income targets managing expenditure budgets experience working schools faith groups volunteers community groups experience managing volunteers experience giving presentations representing organisation range events proven excellent project management skills proven excellent relationship management skills excellent written communication skills including excellent attention detail confident effective verbal communication skills closing date 28th january 2013 interested role wish register tpp hear future posts please send cv fundraisingtppcouk try get touch applications interest however due volume applications receive isnt always possible plus free training fundraisers fundraisers secure role tpp receive cpd voucher use institute fundraising details available tpp profit website http wwwtppcouk cpdvoucher
True labels:	charity,jobs,voluntary
Predicted labels:	charity,jobs,voluntary


In [ ]:
 
In [ ]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
In [ ]:
from functools import partial
def print_evaluation_scores(y_val, predicted):
    f1_score_macro = partial(f1_score,average="macro")
    f1_score_micro = partial(f1_score,average="micro")
    f1_score_weighted = partial(f1_score,average="weighted")
    
    average_precision_score_macro = partial(average_precision_score,average="macro")
    average_precision_score_micro = partial(average_precision_score,average="micro")
    average_precision_score_weighted = partial(average_precision_score,average="weighted")
    
    scores = [accuracy_score,f1_score_macro,f1_score_micro,f1_score_weighted,average_precision_score_macro,
             average_precision_score_micro,average_precision_score_weighted]
    for score in scores:
        print(score,score(y_val,predicted))
In [ ]:
print('Bag-of-words')
print_evaluation_scores(y_test, y_test_predicted_labels_mybag)
print('Tfidf')
print_evaluation_scores(y_test, y_test_predicted_labels_tfidf)
Bag-of-words
<function accuracy_score at 0x7fef9a4a9d30> 0.6055909079654345
functools.partial(<function f1_score at 0x7fef9a576550>, average='macro') 0.5892786511201468
functools.partial(<function f1_score at 0x7fef9a576550>, average='micro') 0.821501106257446
functools.partial(<function f1_score at 0x7fef9a576550>, average='weighted') 0.8052453113207757
functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='macro') 0.40757959686158407
functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='micro') 0.6880770264936487
functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='weighted') 0.7078709573392094
Tfidf
<function accuracy_score at 0x7fef9a4a9d30> 0.5813380543244113
functools.partial(<function f1_score at 0x7fef9a576550>, average='macro') 0.5167656028586087
functools.partial(<function f1_score at 0x7fef9a576550>, average='micro') 0.8147967431880309
functools.partial(<function f1_score at 0x7fef9a576550>, average='weighted') 0.7844110301221876
functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='macro') 0.35677602112734996
functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='micro') 0.6826627731365503
functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='weighted') 0.6938637800525888

Tuning Hyperparameters¶

We'll utilize F1-score weighted as an evaluation measure, and we'll experiment with L1 and L2-regularization procedures in Logistic Regression with various coefficients (e.g. C equal to 0.1, 1, 10, 100).

In [ ]:
import matplotlib.pyplot as plt

hypers = np.arange(0.1, 1.1, 0.1)
res = []

for h in hypers:
    temp_model = train_classifier(X_train_tfidf, y_train, C=h, model='lr')
    temp_pred = f1_score(y_test, temp_model.predict(X_test_tfidf), average='weighted')
    res.append(temp_pred)

plt.figure(figsize=(7,5))
plt.plot(hypers, res, color='blue', marker='o')
plt.grid(True)
plt.xlabel('Parameter $C$')
plt.ylabel('Weighted F1 score')
plt.show()
In [ ]:
# Final model
C = 1.0
classifier = train_classifier(X_train_tfidf, y_train, C=C, model='lr')

# Results
test_predictions =  classifier.predict(X_test_tfidf)
test_pred_inversed = mlb.inverse_transform(test_predictions)
In [ ]:
def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words):
    """
        classifier: trained classifier
        tag: particular tag
        tags_classes: a list of classes names from MultiLabelBinarizer
        index_to_words: index_to_words transformation
        all_words: all words in the dictionary
        
        return nothing, just print top 5 positive and top 5 negative words for current tag
    """
    print('Tag:\t{}'.format(tag))
    
    tag_n = np.where(tags_classes==tag)[0][0]
    
    model = classifier.estimators_[tag_n]
    top_positive_words = [index_to_words[x] for x in model.coef_.argsort().tolist()[0][-8:]]
    top_negative_words = [index_to_words[x] for x in model.coef_.argsort().tolist()[0][:8]]
    
    print('Top positive words:\t{}'.format(', '.join(top_positive_words)))
    print('Top negative words:\t{}\n'.format(', '.join(top_negative_words)))
In [ ]:
mlb.classes_
Out[ ]:
array(['accounting', 'admin', 'advertising', 'catering', 'charity',
       'cleaning', 'construction', 'consultancy', 'creative', 'customer',
       'design', 'domestic', 'energy', 'engineering', 'finance', 'gas',
       'general', 'graduate', 'healthcare', 'help', 'hospitality', 'hr',
       'jobs', 'legal', 'logistics', 'maintenance', 'manufacturing',
       'marketing', 'nursing', 'oil', 'part', 'pr', 'property', 'qa',
       'recruitment', 'retail', 'sales', 'scientific', 'services',
       'social', 'teaching', 'time', 'trade', 'travel', 'voluntary',
       'warehouse', 'work'], dtype=object)
In [ ]:
print_words_for_tag(classifier, 'engineering', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'healthcare', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'sales', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'scientific', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'construction', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
Tag:	engineering
Top positive words:	calco, position candidates, wwwtotaljobscom, alecto, technologies ltd, introduction, pound k, uk skills
Top negative words:	contact recruiter, consultancy job, wwwcwjobscouk, posted wwwcareerstructurecom, wwwcareerstructurecom, jobseeking, posted wwwcwjobscouk, care

Tag:	healthcare
Top positive words:	territory, radiographer, goc, compass associates, optometrist, agency advertises, cares job, employer details
Top negative words:	handson nursing, developer, engineer, firm, ever need, retail, bonuses loyalty, reference jo

Tag:	sales
Top positive words:	repairs capital, posted wwwtotaljobscom, agency defined, following criteriaeducated, equivalentsmichael, wwwsalestargetcouk jobseeking, posted wwwsalestargetcouk, wwwsalestargetcouk
Top negative words:	wwwcwjobscouk, bms leading, wwwretailchoicecom, following criteria, wwwcaterercom, qualified, engineer, chef

Tag:	scientific
Top positive words:	agency employment, science, allied recruitment, field hays, hearing aid, scientific, populus, team24
Top negative words:	apply online, removed, school, engineer, high, developer, financial, social

Tag:	construction
Top positive words:	cscs, cpcs, vertu, energy talent, wwwcareerstructurecom, wwwcareerstructurecom jobseeking, twittercom motortradejobs, posted wwwcareerstructurecom
Top negative words:	uk skills, technology, children, manager sales, care, amp, server, calco

In [ ]:
 

XAI¶

In [ ]:
# !pip3 install lightgbm
In [ ]:
import pandas as pd
from scipy.sparse import coo_matrix, vstack
from sklearn.preprocessing import MultiLabelBinarizer
import lightgbm as lgb
import scipy
import numpy as np
import nltk, re
nltk.download('stopwords') # load english stopwords
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter("ignore")
warnings.warn("deprecated", DeprecationWarning)
warnings.simplefilter("ignore")
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tajalahluwalia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [ ]:
dataset = pd.read_csv('Train_rev1.csv', error_bad_lines=False, engine="python")

dataset = dataset[dataset.Category.isin(dataset.Category.value_counts().index.tolist()[1:5])]

# dataset['Category']= dataset.Category.str[0:].str.split(' ').tolist()
In [ ]:
dataset['Category'].value_counts()
Out[ ]:
Engineering Jobs             25174
Accounting & Finance Jobs    21846
Healthcare & Nursing Jobs    21076
Sales Jobs                   17272
Name: Category, dtype: int64
In [ ]:
dataset = dataset[['FullDescription', 'Category']]
# 70-30% random split of dataset
X_train, X_test, y_train, y_test = train_test_split(dataset['FullDescription'].values, 
                                                    dataset['Category'].values, 
                                                    stratify= dataset['Category'].values,
                                                    test_size=0.33, 
                                                    random_state=420)

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = list((stopwords.words('english')))

def text_prepare(text,join_sumbol):
    
    text = str(text)
    """
        text: a string
        
        return: modified initial string!
    """

    # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(REPLACE_BY_SPACE_RE," ",text,)

    # lowercase text
    text = text.lower() 

    # delete symbols which are in BAD_SYMBOLS_RE from text
    text = re.sub(BAD_SYMBOLS_RE,"",text)
    text = re.sub(r'\s+'," ",text)

    # delete stopwords from text
    text = f'{join_sumbol}'.join([i for i in text.split() if i not in STOPWORDS])
    
    return text


X_train = [text_prepare(x,' ') for x in X_train]
X_test = [text_prepare(x,' ') for x in X_test]
# y_train = [text_prepare(x,' ') for x in y_train]
# y_test = [text_prepare(x,' ') for x in y_test]
In [ ]:
idx = 922

display(y_train[idx])
display(X_train[idx])
'Engineering Jobs'
'immediate electrical engineers hvac electricians needed seeking immediately available electrical engineers work docklands primarily working street lighting lighting faults testing inspection works occasional working pump circuit controls well equipment exterior buildings grounds must 17th edition testing inspection certificate working earlies lates shift pattern wwwprsjobscom job originally posted wwwtotaljobscom jobseeking electricalengineer_job'
In [ ]:
# !pip3 install lime

import lime
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics
from __future__ import print_function
In [ ]:
class_names = y_train

# let's use the tfidf vectorizer, commonly used for text.

vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
In [ ]:
# Use Multinomial Naive Bayes for classification, so that we can make reference to this document.

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=.01)
nb.fit(train_vectors, y_train)
Out[ ]:
MultinomialNB(alpha=0.01)
In [ ]:
pred = nb.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average='weighted')
Out[ ]:
0.9333758658310932

Explaining predictions using lime¶¶

In [ ]:
from lime import lime_text
from sklearn.pipeline import make_pipeline

c = make_pipeline(vectorizer, nb)

print(c.predict_proba([X_train[idx]]).round(3))
print(y_train[idx])
[[0. 1. 0. 0.]]
Engineering Jobs
In [ ]:
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)
In [ ]:
exp = explainer.explain_instance(X_train[idx], c.predict_proba, num_features=10, top_labels=5)
print(exp.available_labels())
[1, 3, 2, 0]
In [ ]:
print ('Explanation for class %s' % class_names[1])
print ('\n'.join(map(str, exp.as_list(label=0))))
print ()
print ('Explanation for class %s' % class_names[0])
print ('\n'.join(map(str, exp.as_list(label=2))))
Explanation for class Engineering Jobs
('circuit', -0.0005055037273207403)
('inspection', -0.00048667664420182634)
('electricalengineer_job', -0.0004837421730272294)
('pump', -0.00047322764056142227)
('electrical', -0.0004427655429782297)
('lighting', -0.00043357716891999305)
('edition', -0.000430327198661014)
('earlies', -0.0004115406234869048)
('17th', -0.0004003031555053982)
('engineers', -0.00039012163726526915)

Explanation for class Healthcare & Nursing Jobs
('electricians', -0.0013826400350452404)
('electricalengineer_job', -0.0013042528695080334)
('exterior', -0.0012860932130299943)
('lighting', -0.0012311863798456547)
('engineers', -0.0012120915293569073)
('electrical', -0.001177728219577556)
('circuit', -0.0010346291740168962)
('edition', -0.0010129480165906883)
('wwwprsjobscom', -0.0009745095367927377)
('earlies', 0.0007409230943452269)
In [ ]:
exp.show_in_notebook(text=X_train[idx], labels=(exp.available_labels()))
In [ ]:
 
In [ ]: